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1. Detailed Comparison of Performance Results on Synthetic Data 

Here, we include the performance results for the comparison of POLARIS to the optimiza- 
tion BIC and the clairvoyant DiProg. Figures 1, 2, and 3 show the comparison results using 
recall and precision as performance metrics and both small and asymptotic sample sizes, 
for CMPNs, DMPNs, and XMPNs, respectively. We separated the recall and precision in 
order to highlight the asymmetry in Polaris's performance. That is, POLARIS performs 
considerably better in recall and consistently introduces a slightly higher number of false 
edges in the reconstructed graph. The asymptotic sample size is included to experimentally 
verify the convergence of Polaris. Note that theorem 1 only guaranteed convergence on 
graphs without transitive edges, but even with transitive edges, Polaris converges almost 
completely at only 2000 samples. 
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Detailed Performance Comparison for 
CMPN on Synthetic Data 
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Network Variables: 10 
Performance Metrics: Recall, Precision 
Noise Levels: 0% : 57 0 ,10%,15%,20%,25%,30% 
Sample Sizes: 50,100,150,200,250,300,500,(2000) 




Figure 1. The experimental performance results for Polaris, BIC, and 
clairvoyant DiProg on CMPNs, measured in terms of recall (left panels) 
and precision (right panels). To show the asymptotic behavior of the three 
algorithms, we plotted the performance for sample sizes up to 2000 (bottom 
panels). For comparison, we also included the performance on more realistic 
sample sizes (top panel). 
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Detailed Performance Comparison for 
DMPN on Synthetic Data 






C/3 




0> 


CO 


Q. 


E 


E 


to 


CD 




to 



O 




"-4— ' 




o 


CD 


-t— ' 
Cl 


Q_ 


E 


E 


>. 


CD 


CO 


to 


< 





w*m;" 0,30 



Recall 




Precision 



Network Variables: 10 

Performance Metrics: Recall, Precision 

Noise Levels: 0%,5%,10%,15%,20% r 25%,30% 

Sample Sizes: 50,100,150,200,250,300,500,(2000) 



Figure 2. The experimental performance results for Polaris, BIC, and 
clairvoyant DiProg on DMPNs, measured in terms of recall (left panels) 
and precision (right panels). To show the asymptotic behavior of the three 
algorithms, we plotted the performance for sample sizes up to 2000 (bottom 
panels). For comparison, we also included the performance on more realistic 
sample sizes (top panel). 
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Detailed Performance Comparison for 
XMPN on Synthetic Data 
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Network Variables: 10 
Performance Metrics: Recall, Precision 
Noise Levels: 0%,5%,10%,15%,20%,25%,30% 
Sample Sizes: 50,100,150,200,250,300,500,(2000) 




FIGURE 3. The experimental performance results for POLARIS, BIC, and 
clairvoyant DiProg on XMPNs, measured in terms of recall {left panels) 
and precision [right panels). To show the asymptotic behavior of the three 
algorithms, we plotted the performance for sample sizes up to 2000 {bottom 
panels). For comparison, we also included the performance on more realistic 
sample sizes {top panel). 
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Figure 4 demonstrates the efficacy and correctness of the a- filter in rejecting hypotheses 
prior to optimization of the score, in each of the three types of MPNs. For each type of 
MPN, the average number of rejected true hypotheses is considerably smaller than one and 
converges to zero for medium sample sizes. The a-filter is particularly effective at pruning 
the hypothesis space of XMPNs, rejecting approximately 1000 hypotheses on average, out 
of a possible 1300 hypotheses. It is slightly less effective for CMPNs, rejecting between 500 
and 1000 hypotheses. Finally, it is least effective for DMPNs, rejecting between 150 and 
350 hypotheses. 




Efficacy Error Rate 

Network Variables: 10 

Noise Levels: 0%,5%,10%,15%,20%,25% 1 30% 
Sample Sizes: 50,100,150,200,250,300,500,2001 



Figure 4. The a-filter rejects hypotheses prior to optimization of the score. 
The figures on the left show the efficacy, measured in terms of the number of 
hypotheses eliminated prior to optimization. The figures on the right show 
the error rate, measured in terms of the average number of true hypotheses 
rejected. 
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2. Time complexity of Polaris optimization 

The evaluation of POLARIS scores for all hypotheses dominate the computational com- 
plexity of our algorithm. We analyze the asymptotic complexity of this computation and 
show that its parametric complexity is exponential, where the exponent is determined 
by the parameter. For a fixed (in practice, small) value of the parameter, Polaris is 
polynomial and tractable. To estimate the complexity, we first determine the complexity 
of computing the score for any single hypothesis; then we multiply this function by the 
number of hypotheses to get the total cost, which is 

0(M -N 2 ■ (TV - l) fe ). 

Here, the parameter k is the maximum number of parents for any node (and can be safely 
bounded by 3, in practice), and the input size is determined by M and N: respectively, the 
number of samples, and the number of variables. In practice, the a filter helps performance 
tremendously, as it avoids the log likelihood (LL) computation for at least nearly half of 
the hypotheses (see figure 4). 

2.0.1. Computing the score for a single hypothesis. The bulk of the score computation effort 
is expended in computing a and the LL. The a computation is divided into computing 0+'s 
and which are just the probabilities of each row in the matrix, encoding Conditional 
Probability Distributions, CPD. Both computations entail counting the number of samples 
that correspond to each row and thus in total, take O(M-N) time. The maximum likelihood 
(ML) parameters in the LL score are precisely the 0+'s and #r's computed for a. Actually 
computing the LL given the ML parameters requires iterating through the samples one more 
time and matching each sample to its corresponding CPD row. Thus, LL computation also 
takes 0(M ■ N) time. Combining all, the total local score computation for one node still 
takes 0(M ■ N) time. 

2.0.2. Number of hypotheses. The hypotheses corresponding to one node consist of its 
possible parent sets. A node can have parent sets of size 0 to size k, but it cannot be its 
own parent. Thus, the total number of parent sets for one node is Yli=o (^j" 1 ) - fi na l 
term dominates the series, and thus asymptotically, the number of hypotheses for one node 
is = 0(N k ). 

3. Proofs of theorems on asymptotic convergence 

Next, in this section, we prove several important properties about the asymptotic per- 
formance of Polaris. The main results are summarized in Theorem 1, which defines the 
type of structures that are learnable by Polaris and the conditions under which they are 
guaranteed to be learnable. 

Lemma 1 (Convergence of a-filter). For a sufficiently large sample size, M, the a- 
filter produces no false negatives for Conjunctive, Disjunctive and Exclusive Disjunctive 
Monotonic Progressive Networks: CMPNs, DMPNs, and XMPNs, respectively. 
Proof: 
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By the law of large numbers, the empirical estimates for all rows of the CPDs will 
converge to their corresponding true parameter values. To show that the a filter will not 
create false negatives, we show that a for all true parent sets must be strictly positive for 
all rows of the CPDs. The a values for positives rows are always 1 and will thus never be 
negative. The a values for negative rows may be negative, if 9 + < 6~ , for negative row i 
of a CPD and 6 + as appropriately defined for each of the MPN types. Thus, we will show 
that for all 3 types of MPNs, each negative row will have a strictly positive a. In all three 
cases, we use the fact that the conditional probability for all negative rows of all CPDs is 
strictly below e and that for the positive rows is strictly above e. 

Case I: CMPN. 0+ = V(X = 1 | J2 p a(X) = \Pa(X)\). Here, 6 + refers to the con- 
ditional probability of 1 positive row, which is by definition larger than e, or restated, 
6 + — e > 0. Combined with the fact that 9~ < e, it follows that 6 + > 9~ and thus, a will 
never be negative. 

Case II: DMPN. 6 + = V(X = 1 | *£Pa(X) > 0). The derivation below establishes that 
6 + is always strictly larger than e for the true parents sets in a DMPN. The summation 
in step (1) is over all values of the parents that are not all zeroes. Here, n refers to the 
number of parents in Pa(X). That is, n = |Pa(X)|. The inequality in step (2) exploits 
the fact that each conditional probability V(X \ J^-Pa(X) = i) corresponds to a positive 
row and is thus strictly larger than e. 



V(X | ^Pa{X) > 0) 

P(X,^Pa(X)>0) 
P(£Pa(X)>0) 

= >^T ( - V ^ /, "- Vi " [step(l)] 

YT=i 1 Hx\Y.Pa{x) = i)HY.P<x) 

YT=i l nT.Pa{X) = i) 

TLT 1 v^P<x) = i) [step(2)] 



Ell' HZ Pa(X) = i) 



Case III: XMPN. 6 + = V{X = 1 | E Pa{X) = 1). The derivation below shows, just like 
in the DMPN, that 6 + > e for all true parents sets in the XMPN. The reasoning behind 
the steps is similar to that above, except for the summation is step (2) is over the rows 
in which exactly one parent takes value 1 and the rest take value 0. To denote this, we 
use the standard notation Pai{X) to mean the i th parent of X and Pa~i(X) to mean all 
parents except for the i th parent of X. 



8 



ILYA KORSUNSKY, DANIELE RAMAZZOTTI, GIULIO CARAVAGNA, BUD MISHRA 



V(X\Y J P<X) = l) 

V(X,J2Pa(X) = l) 

E?=i V{X, P ai (X) = 1, Pa-i(X) = 0) 



step(l') 



> 



E2=i'P(Pai(X) = l,Pa- i {X) = 0) 

Er=i[^(^ i p < x ) = i,pg-i(x) = omw = = o)] 

^ =1 P(Pa i (X) = l,Pa-,(X) = 0) 
Zi=i^(P< X ) = hPa-i(X) = 0) 



E?=i P(P<X) = 1, Pa_ t (X) = 0) 

Er =1 P(i ? a j W = i,Pa- i (x) = o) = 
E"=i^(^W = i^«-W = o) ' 

Lemma 2 (Consistency of Polaris). Polaris is a statistically consistent score. 
Proof: 

Let M be the number of samples generated by the graph G* = (V, E*). Let G = (V,E) 
be the graph learned by maximizing the Polaris score, and Gbic he the graph learned 
by maximizing the BIC score, both for a sufficiently large M. The Polaris score consists 
of three terms: the log-likelihood (LL) term and the regularization term from BIC and 
the monotonicity term. Each of these terms grows at different rates. The LL term grows 
linearly (O(M)) with the number of samples. The regularization term grows logarithmically 
(O(logM)). The monotonicity term does not grow (O(l)), since the sum of a scores 
(EdeD a d) grows linearly with the number of samples, M, but it is weighted by 1 /m. 
Consequently, it is subsumed by the other two terms. Thus, any perturbation to the 
graph G that would increase the monotonicity score but decrease the BIC score would 
also decrease the Polaris score. From the consistence of BIC theorem, we know that any 
perturbation to the undirected skeleton or v -structures of Gbic would result in a lower BIC 
score. It follows that for sufficiently large M, the addition of the monotonicity term will 
not change the undirected skeleton or w-structures of Gbic- Therefore, G is /-equivalent 
to Gbic and by transitivity, G is /-equivalent to G* □. 

Theorem 1 (Convergence conditions for Polaris). For a sufficiently large sample size, 
M, under the assumptions of no transitive edges and faithful temporal priority relations 
between nodes and their parents at least for nodes that have exactly one parent, optimizing 
Polaris convergences to the exact structure for MPNs. Proof. 



Let G* = (V,E*) be the graph that generates the data and G, the graph learned by 
optimizing the POLARIS score. By the POLARIS consistency Lemma, for sufficiently large 
M, the undirected skeleton and ^-structures of G are the same as those of G* . Below, we 
show that under assumptions of temporal priority for all parent-child relations, G = G* . 
We proceed by showing that the parent set of each node is learned correctly, by considering 
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nodes that have zero parents, one parents, or two or more parents. It then follows that all 
of the edges in the undirected skeleton of G* are oriented correctly and thus G = G* . 

Case I: Xi has 0 parents. If has no parents, then the undirected skeleton around 
Xi will only include the edges to the children of X^. Thus, the empty parent set is 
learned correctly. 

Case II: Xi has 1 parent. Let Xj be the parent of Xi. 

Case II A: Xj has 0 parents. By definition, Xj has 0 parents and Xi has 
exactly 1 parent, Xj. Reorienting the edge Xj — > Xi to Xj <— Xi results 
in an /-equivalent graph globally, because the edge is not involved in a v- 
structure in either orientation. Thus, the BIC score for both orientations is the 
same, and in order for Polaris to correctly choose Xj — > Xi over Xi — > Xj, 
it must be the case that ax^Xj < ctx^Xi- In the derivation below, we 
show that this condition is equivalent to the condition for temporal priority. 
Namely, ax^Xj < Q-x^Xi is equivalent to V(Xi) < V(Xj). To conserve 
space, we let V(Xi \ Xj) = 9+ and ViXi \ Xj) = 9~ . Also, we use the identity 

v[Xi) = v(Xi | Xj)v(Xj) + v(Xi | Xj)v(Xj) = e+v{Xj) + e~v{Xj). The 

following statements are all equivalent 



aiXi^Xj < cnxj^x, 

r(Xj | x^ - r(Xj | x^ ^ 9+ 



< 



v{Xj i Xi) + v{Xj i x t ) 9+ + 
0 v(Xi) y L 0 >r(Xi) 



< 



d +nxj) + {l _ 9+) nM e+ + e 

U V(Xi) + V 1 C >V{Xi) 

_J± 1-0+ a + — 

V(Xi) l-V(Xi) 9^ - 



< 



e+ i 6+ + 6- 

e+ji-vjXj))- (i-e + )v(Xj) e + -e- 
9+{\-v{x i )) + {i-e+)v{x l ) e+ + e- 

9+ - V(Xi) ^9+ -0- 



9+ - 29+V{Xi) + V{Xi) 9+ + 9-' 
which is equivalent to the following inequalities: 

9+ - (9- + T(Xj)(9 + -9-)) ^ 9+ 



9+ - 29+{9- + V{Xj){9+ -9-)) + 9- + V{Xj){9+ - 9-) 9+ + 9 

-9-)(l-V 

)(e + -e-) 



(0+-0-)(l-V(X j )) g+-, 

9+ - 29+9- - 29+V(Xj)(9+ -0-) + O~ + T(Xj)(9+ -9-) 9+ + , 



< 



9+ - 29+9- - 29+T(Xj)(9+ - 6~) + 9~ + T(Xj)(9+ - 6~) 9+ + 



10 



ILYA KORSUNSKY, DANIELE RAMAZZOTTI, GIULIO CARAVAGNA, BUD MISHRA 



thus implying 

e + - 26 + e- - 29 + r(x j )(e + -e-) + e~ + r(x j )(e + - e~) > (i - v(x j ))(e + + e~) 

= 9+ - 29+0- - 2(9 + ) 2 V(X j ) + 29 + 9-V{Xj) + 9~ + 9 + P{Xj) - 9~P{Xj) 
>9 + + 9- - 9 + V{Xj) - 9-p(Xj). 
Simplifying further, we have 

-29- - 29 + V{Xj) + 29-V(Xj) > -2P(Xj) 
= 9- + 9+P(X j )-9-p(X j )<P(X j ) 
= 9 + P(X j ) + 9-(l-P(X j ))<P(X j ) 
= P(X i )<P(X J ). 

Case IIB: Xj has 1 or more parents. Incorrectly reorienting the edge Xj — > Xj 
to Xj <— Xi makes X; L a parent of Xj. Because G* is acyclic and has no 
transitive edges, there are no edges between Xi and the true parents of Xj. 
Thus, making Xi a new parents of Xj creates a new u-structure (case III 
proves that if Xj has 2 or more parents, then they are all unwed), consisting 
of Xi , Xj , and the true parents of Xj , that is not in G* . This contradicts the 
consistency of Polaris and thus the edge Xj — > Xi will never be reoriented. 
Case III: Xi has 2 or more parents. Because G* has no transitive edges, there 
cannot be any edge between any two parents of Xi. Thus, the parents of Xi are 
unwed and form a v -structure with Xi. Because Polaris is consistent, this v- 
structure is learned correctly. □. 
Corollary 1 (Convergence conditions for Polaris with filtering). For a sufficiently large 
sample size, M , under the assumptions of no transitive edges and faithful temporal priority 
relations, filtering with the a-filter and then optimizing Polaris convergences to the exact 
structure for MPNs. Proof. 

In Lemma 1, we showed that a-filtering removes no true parent sets. In Theorem 1, 
we showed that given a hypothesis space that includes the true parent sets, optimizing 
Polaris returns the true graph. Because the a-filter does not remove the true parent sets 
from the hypothesis space, optimizing Polaris will still return the correct structure on 
the filtered hypothesis space. □. 



